Capstone: House Price Prediction

By: David Blake Hudson

Problem Statement:

A house value is simply more than location and square footage. Like the features that make up a person, an educated party would want to know all aspects that give a house its value. For example, you want to sell a house and you don’t know the price which you may expect — it can’t be too low or too high. To find house price you usually try to find similar properties in your neighborhood and based on gathered data you will try to assess your house price.

The objective of this problem statement is to predict the housing prices of a town, or a suburb, based on the features of the locality provided and identify the most important features to consider while predicting the prices.

Data Dictionary:

This dataset has 23 features, price being the target variable. The details of all the features are given below:

Observations for further research

Observations on high bedroom values

Observations on low bedroom and bathroom counts

Code below looks for the rows with the same number of missing observations and

Zipcodes

There are too many unique zip code variables to get meaningful EDA and my cause overfitting. Grouping zipcodes based on their first digits. But should it be the first two, three, or four? The code below shows the optimum is the first FOUR digits as two and three leave so few groups and five just leaves all of the unique zipcodes.

There are some variable that I intend to use as integers in the model that I am going to bunch with those that I intend to use as categorical dummy values so I can transform them into category data types in a separate df called 'df2' to be used during EDA. The reason being that I am more interested in seeing the mode for things like number of bedroooms than I am in seeing the average for bedrooms.

Univariate Analysis

Observations

Well the house has 6 bedrooms, 8 bathrooms, condition of 4 and quality of 13. Appears to just be a mansion

Observations

Price

Observations

Living Measure

Observations

Lot Measure

Observations

Ceil Measure

Observations

Basement

Observations

Living Measure 15

Observations

Lot Measure 15

Observations

Total Area

Observations

Bedrooms

Observations

Bathrooms

Observations

Year Built

Observations

I feel like data at this granular level is too chaotic. So I am creating a new variable that represents the decade that the house built.

Decade Built

Observations

Year Renovated

Observations

Again, I am going to create another decade variable for renovations

Decade Renovated

Observations

Ceil

Observations

Coast

Observations

Sight

Observations

Condition

Observations

Zipcode

Zipcode2

Observations

Quality

Observations

Furnished

Observations

Year House was Sold

Observations

Months House was Sold

Observations

Multivariate Analysis

Observations

Pairplot

Observations

Pairplot broken down by Zipcode2

Observations

Pairplot broken down by condition

Observations

Pairplot broken down by furnished

Observations

Pairplot broken down by bath rooms

Observations

Pairplot broken down by ceil

Observations

Pairplot broken down by coast

Observations

Pairplot broken down by year sold

Observations

Pairplot broken down by month sold

Observations

Pairplot broken down by decade built

Observations

Pairplot broken down by quality

Observations

Cramer's V for Categorical Variables

Observations

Multivariate Analysis

Continuous Variables Over Time

Count of houses per month

This chart is to show the distribution of houses sold over time to help interpret the variance in the following graphs.

Observations

Price over time

Observations

Living Measure over time

Observations

Lot Measure over time

Observations

Ceil Measure over time

Observations

Basement over time

Observations

Living Measure15 over time

Observations

Lot Measure15 over time

Observations

Total Area over time

Observations

Price by Bedroom

Observations

Price by Bathroom

Observations

Price by Year House was Built

Observations

Price by Year House was Renovated

Observations

Price by Ceil

Observations

Price by Coast

Observations

Price by Sight

Observations

Price by Condition

Observations

Price by Zipcode

Observations

Price by Zipcode2

Observations

Price by Quality

Observations

Furnished

Observations

Month that House was Sold

Observations

Year that House was Sold

Observations

EDA Conclusion

None of the continuous variables were normally distributed and this will cause issues for the linear regression model without any log transformations. Living measure, ceil measure, and living measure 15 looked like the best continuous variables to use to predict price. Furnshed, zipcode, quality, condition, ceil, yr_renovated, room_bath, and room_bed appear to be important categories when predicting price. Total_area will need to be dropped since it is perfectly correlated with lot_measure.

Data Pre-Processing

I am dropping total_area since it is the sum of living measure and lot measure and has perfect colinearity with lot measure. I am dropping zipcode, yr_built, and yr_renovated since we have alternative variables that are aggregated versions of these to help us reduce the dimensionality of our data and make our model more general.

I transforming all of the categorical variables category data types, but I am transforming sight and room_bed to integer data types because I believe it will be best used as a ordinal variable than in the regression rather than a dummy variable for each value for the variables. However, I have to wait until I impute missing values before I can turn these variables to integers.

Log transformations and outlier detection

First, the continuous variables are plotted one more time and then plotted again after they have undergone the log transformation. Prior to log transformation, extreme values are analyzed to detect any values that appear to be unreasonable outliers that need to be treated.

Observations

As I can rationalize the extreme outliers, I will treat them as values that do represent the process for their respective variables and therefore not cap them. I will only perform log transformation on these variables and check the results

Observations

Creating binned Basement Category variable

I am binning basement into three categories: those with no basement, those who have a basement and a size this up to 50 percentile among those who have a basement, and then those who have a basement above larger than then 50 percentile. The categories are labeled as No Basement, Small Basement, and Large Basement.

Imputing missing variables

Splitting the data into training and testing data prior to imputation to avoid data leak of test data information to train data and biasing our model.

Explatory Data Analysis for new variables

Analyzing the new variables created.

Univariate Analysis

Basement Categories

Using the data from before split and imputation for this one since these became their own dummy variables

Observations

The rest of the variables were created after splitting splitting and imputing the variable. So for exploratory purposes, I rejoined the train and test data together in another df and then added back in log_price so that we could get an overall picture.

Log Price

Observations

Log Living Measure

Observations

Log Living Measure15

Observations

Log Lot Measure

Observations

Log Lot Measure15

Observations

Log Ceil Measure

Observations

Log Basement

Observations

Data Processing EDA Conclusion

Except for log basement, all continuous variables are now approximately normally distributed after log transformation. Basement has been binned into 3 categories in case basement's skewed distribution is the results of a mix of Gaussians. Interaction terms have been created for combinations of furnished with log_ceil_measure, and both log_living_measures and combinations of ceil and the different basement categories. I have kept the original variables that have been transformed to be used for the algorithms that use regression trees and do not require the target variable and predictors to be normally distributed.

Supervised Learning Models

The following block of code creates a function 'checking_vif' to measure the VIF score, the next block of code creates the function 'get_model_score' that reports multiple metrics (RMSE, MAPE, MAE, Economic Cost) to help us evaluate the different models.

Along with the usual metrics to evaluate the models, an additional metric Economic Cost is added to give an additional perspective of the model. If we assume that a house does not sell if it is overpriced, we can consider that as an opportunity cost of not selling the home. Additionally, an underpriced home will sell at a discount, and we can consider the lost potential revenue between what the house could have sold for and what the house sold for is as an opportunity cost as well. Economic Cost is defined as mean of the sum of the sell price for a house our model overpriced and the residual of a house that we underpriced. This metric will be biased against the houses we overpriced since we give greater weight to the house that we overpriced than the one we underpriced.

Datasets

The code below creates a separate dataset for the different set of algorithms. For the linear regression, we are including the log transformed versions of the vars and the interaction terms. Ridge and Lasso get a normalized version of this dataset. XGBoost already accounts for interactions and does not depend on the normality assumption, so it gets the dataset void of log transformations and interaction terms.

Naive Conditional Mean

The first model to run as a benchmark is the mean of the price conditioned on the var zipcode2. I chose this var given how location is known to be important to real estate prices. This model serves as a naive estimate and its output will serve as a initial benchmark to compare the rest of the models.

Observations:

OLS Linear Regression

The linear regression model attempts to find a linear relationship (line, plane, etc) that best represents the relationship between the group of explanatory variables and the target variable (housing price in our case). To do so, we will estimate OLS model on the full set of explantory variables. Then we will weed out those variables who have high VIF scores and/or statistically insignificant estimated coefficients.

Observations

Observations

Observations

Observations

Observations

Observations

To make the model more complete, I am combining ceil_3.0 and ceil3.5 vars and naming their replacement column as ceil>=3.0. I then replace the flawed interaction terms with the new interaction terms that combine the basement categories with the new ceil column. I then rerun the model.

Observations

We must check that the OLS assumptions are satisifed before walking away from the OLS regression.

Residual mean should be 0

The residual mean is approximately 0.

Homoskedasticity

Test for heteroskedasticity using the Goldfeldquandt test.

H0 : Residuals are homoscedastic

HA : Residuals have hetroscedasticity

alpha = 0.05

We fail to reject the null hypothesis, meaning that the model does not suffer from heteroskedasticity.

Linearity

Below the residuals are plotted against the fitted values and followed by a distribution plot of the residuals.

The residuals cluster into a mostly spherical cloud like desired. There are a few outliers on the tails that deviate the mean from 0, especially for the lower fitted values. But, I already checked the extreme values in the explantory vars and there was no reason to believe these values were outside of the natual processes for these vars. Overall, I think that this is a reasonable residual plot. The residuals are normally distributed in the distribution plot on the side.

Normality

Observations

Predicted vs Observed

Observations

Conclusion

The OLS linear model satisfies all of the assumptions.

Ridge Regression

Ridge regression is a regularization algorithm that reduces the chance of overfitting by biasing the model applying the L1 penalty to the original model. The result is a model with coefficients reduced towards zero, but not exactly zero, and that has high bias, but low variance. An important component of the L1 penalty is the scalar term alpha. What is the best value for alpha? GridSearchCV is applied to determine the best value for alpha.

Observations

Residuals should have mean of 0

The residual mean is approximately 0.

Homoskedasticity

The test fails to reject the null hypothesis that the errors are heteroskedastic.

Linearity

Residual cloud like desired with a a normally distributed residual term around 0 in the side plot.

Normality

Observations

Predicted vs Observed

Observations

Lasso Regression

Observations

Residuals should have a mean of 0

The residual mean is approximately 0.

Homoeskedasticity

The test fails to reject the null hypothesis that the errors are heteroskedastic.

Linearity

Residual cloud like desired in the main plot, althought it could be more circular. On the side plot, we see a approximately normally distributed residual term around 0.

Normality

Observations

Observed vs Predicted

Observations

Random Forest

Observations

AdaBoost

Observations

GradientBoost

Observations

Observed vs Predicted

Observations

XGBoost

Observations

Observations

Model Comparison

Observations

Feature Importance

Observations

Insights and Recommendations:

Comments Future Analysis

I also tried using K-Means clustering to find groups based on location in place of zipcodes. This model performed slightly worse. Another possibility is to get GPS coordinates for local schools, hospitals, fire departments, etc. and calculate each house distance to the nearest location of each type of institution. If that fails, I may have to use a seperate model to estimate the houses with the greatest residuals.

Conclusion

The XGBoost model is the best model for predicting prices in King County, Washington with a test r-square of 0.82 and RMSE of \$161,215. Houses with higher prices were associated with being furnished, in the northern part of the county, larger living areas, higher quality scores, a coastal view, more views, and outside of areas with medium price averages. These features can help sellers act to increase their house prices and buyers know what to consider when looking for lower prices.